Skip to content

[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard#60819

Open
kouroshHakha wants to merge 1 commit intoray-project:masterfrom
kouroshHakha:kh/nixl-metrics
Open

[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard#60819
kouroshHakha wants to merge 1 commit intoray-project:masterfrom
kouroshHakha:kh/nixl-metrics

Conversation

@kouroshHakha
Copy link
Contributor

Summary

  • Add 6 new Grafana panels to the Serve LLM dashboard for monitoring NIXL KV cache transfers in prefill-decode (P/D) disaggregated serving
  • Panels are inserted after the existing vLLM engine metrics and before the token summary panels, with all subsequent panel positions adjusted accordingly

New Panels

Panel Metric Description
NIXL: Transfer Latency ray_vllm_nixl_xfer_time_seconds Average RDMA transfer duration (ms)
NIXL: Transfer Throughput ray_vllm_nixl_bytes_transferred / ray_vllm_nixl_xfer_time_seconds Effective transfer bandwidth (GB/s)
NIXL: Transfer Rate ray_vllm_nixl_xfer_time_seconds_count KV transfers per second
NIXL: Avg Post Time ray_vllm_nixl_post_time_seconds Time to post/initiate a transfer (ms)
NIXL: KV Transfer Failures ray_vllm_nixl_num_failed_transfers Failed RDMA transfers (alerting)
NIXL: KV Expired Requests ray_vllm_nixl_num_kv_expired_reqs Requests whose KV blocks expired before decode consumed them (alerting)

These metrics are emitted by vLLM's NixlConnector and wrapped via RayPrometheusStatLogger -> RayKVConnectorPrometheus -> NixlPromMetrics. The failure/expiration panels only show data when errors occur (counters are lazily registered on first increment).

Screenshots

Screenshot 2026-02-06 at 6 05 23 PM Screenshot 2026-02-06 at 6 04 59 PM

Test plan

Since the dashboard panels file is loaded at Ray startup and cannot be hot-reloaded on a running cluster, we used the following approach to validate the changes end-to-end:

1. Panel definition validation

cd ray && python -B -c "
from ray.dashboard.modules.metrics.dashboards.serve_llm_dashboard_panels import SERVE_LLM_GRAFANA_PANELS
print(f'Total panels: {len(SERVE_LLM_GRAFANA_PANELS)}')
for p in SERVE_LLM_GRAFANA_PANELS:
    print(f'  id={p.id:3d}  y={p.grid_pos.y:3d}  x={p.grid_pos.x:2d}  {p.title}')
"

Confirmed: 31 panels loaded (25 existing + 6 new NIXL), all IDs unique, GridPos layout correct.

2. Dashboard JSON generation

Since the installed ray package doesn't include the new panels yet, we generated the dashboard JSON by patching the module at import time:

# generate_nixl_dashboard.py
import importlib.util, os, sys

local_panels_path = "ray/python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py"
spec = importlib.util.spec_from_file_location(
    "ray.dashboard.modules.metrics.dashboards.serve_llm_dashboard_panels",
    local_panels_path,
)
local_panels_module = importlib.util.module_from_spec(spec)
sys.modules["ray.dashboard.modules.metrics.dashboards.serve_llm_dashboard_panels"] = local_panels_module
spec.loader.exec_module(local_panels_module)

from ray.dashboard.modules.metrics.grafana_dashboard_factory import _generate_grafana_dashboard

config = local_panels_module.serve_llm_dashboard_config
content, uid = _generate_grafana_dashboard(config)

with open("serve_llm_dashboard_nixl.json", "w") as f:
    f.write(content)

3. Live Grafana validation

  • Imported the generated JSON into Grafana on an Anyscale cluster running NIXL P/D disaggregation (8 prefill + 8 decode replicas with NixlConnector)
  • Added a ClusterId template variable to scope metrics to the active cluster
  • Confirmed all 6 NIXL panels render correctly with live data under load
  • Verified the existing 25 panels continue to work as expected

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha requested a review from a team as a code owner February 7, 2026 02:14
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Feb 7, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds valuable new Grafana panels to the Serve LLM dashboard for monitoring NIXL KV cache transfers. The changes are well-described and the test plan is thorough. I've identified a few areas for improvement in the new panel definitions to enhance consistency and correctness. Specifically, I'm suggesting a change to the throughput calculation to align with Grafana best practices, and updates to the failure/expiration panels to improve observability by including model_name in the aggregation.

Comment on lines +349 to +362
id=41,
title="NIXL: Transfer Throughput",
description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).",
unit="GBs",
targets=[
Target(
expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024',
legend="Throughput - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for "NIXL: Transfer Throughput" has some inconsistencies: the expression calculates throughput in Gibibytes per second (GiB/s) using base-1024 division, while the description refers to GB/s (base-1000), and the unit GBs is non-standard in Grafana.

To align with Grafana best practices and improve clarity, I recommend removing the manual division from the expression and setting the unit to bytes/sec. Grafana will then automatically format the value with the appropriate SI prefix (e.g., KB/s, MB/s, GB/s), which is standard for data rates.

Suggested change
id=41,
title="NIXL: Transfer Throughput",
description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).",
unit="GBs",
targets=[
Target(
expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024',
legend="Throughput - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),
id=41,
title="NIXL: Transfer Throughput",
description="NIXL KV cache transfer throughput (bytes transferred / transfer time).",
unit="bytes/sec",
targets=[
Target(
expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])',
legend="Throughput - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),

Comment on lines +403 to +404
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{WorkerId}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better observability, it would be helpful to see failed transfers broken down by model_name, especially when multiple models are served. The current query filters by model_name but aggregates failures across all selected models. Please consider adding model_name to the sum by clause and the legend.

Suggested change
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{WorkerId}}",
expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{model_name}} - {{WorkerId}}",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the legend

Comment on lines +419 to +420
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{WorkerId}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the transfer failures panel, it would be beneficial to see expired requests per model_name for more granular monitoring. The current query aggregates these across all selected models. Please consider adding model_name to the sum by clause and the legend.

Suggested change
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{WorkerId}}",
expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{model_name}} - {{WorkerId}}",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the legend

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

unit="ms",
targets=[
Target(
expr='rate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_count{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n* 1000',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing aggregation in NIXL latency/throughput PromQL queries

Medium Severity

The NIXL Transfer Latency (panel 40), Transfer Throughput (panel 41), and Avg Post Time (panel 43) panels divide rate() expressions without using sum by(model_name, WorkerId) aggregation. All other average calculations in this file (e.g., lines 55, 91, 163, 227) follow the pattern sum by(model_name, WorkerId) (rate(..._sum...)) / sum by(model_name, WorkerId) (rate(..._count...)). Without aggregation, if metrics have additional labels beyond model_name and WorkerId, Prometheus will perform element-wise division which may produce cluttered graphs or no data when label sets don't match exactly between numerator and denominator.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Suggest to replace workerId with replicaId and include model name in legends

Comment on lines +403 to +404
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{WorkerId}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the legend

Comment on lines +419 to +420
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{WorkerId}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the legend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants